Part 1: Introduction

Data Source

The AidData Lab at the College of William and Mary has led the work to probe into the mystery of Chinese foreign investment, publishing several versions of AidData’s Global Chinese Official Finance Dataset (2000-2014) with link attached below. In this study, we will try to learn more about the behaviors of official Chinese foreign investment using data wrangling, text mining, and machine learning techniques.

https://www.aiddata.org/data/chinese-global-official-finance-dataset

Staging the Data (R)

# Clear memory and run checks ---------------------------------------------

rm(list = ls())

# Load packages -----------------------------------------------------------

library(readxl)
library(dplyr)
library(ggplot2)
library(caret)
library(readxl)
library(tidytext)
library(wordcloud)
library(RColorBrewer)
library(robustbase)
library(gbm)
library(earth)
library(randomForest)

# Load and clean data -----------------------------------------------------

flow <- readxl::read_excel("GlobalChineseOfficialFinanceDataset_v1.0/GlobalChineseOfficialFinanceDataset_v1.0.xlsx")

flow <- flow %>%
# We will only include projects recommended for research 
  filter(recommended_for_research == TRUE) %>% 
# We will only include projects not part of others, which means umbrella == 0
  filter(umbrella == 0) %>% 
# We discard indistinguishable flow type projects
  filter(flow_class != "Vague (Official Finance)") %>% 
# We select only variables with effective numbers of values
  select(year, funding_agency, implementing_agency, 
         recipient_condensed, recipient_region,
         title, description, status, flow_class, 
         usd_defl_2014, usd_current, crs_sector_name, 
         start_actual, start_planned,
         end_actual, end_planned, 
         is_cofinanced)

Summarize and Visualize Data

After performing data cleaning, the dataset is left with 4013 useful projects with 17 variables to be further analyzed.

sum_data <- flow %>% summary()
sum_data
##       year      funding_agency     implementing_agency recipient_condensed
##  Min.   :2000   Length:4013        Length:4013         Length:4013        
##  1st Qu.:2006   Class :character   Class :character    Class :character   
##  Median :2009   Mode  :character   Mode  :character    Mode  :character   
##  Mean   :2008                                                             
##  3rd Qu.:2012                                                             
##  Max.   :2014                                                             
##                                                                           
##  recipient_region      title           description       
##  Length:4013        Length:4013        Length:4013       
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
##                                                          
##     status           flow_class        usd_defl_2014      
##  Length:4013        Length:4013        Min.   :1.760e+02  
##  Class :character   Class :character   1st Qu.:1.018e+06  
##  Mode  :character   Mode  :character   Median :8.134e+06  
##                                        Mean   :1.253e+08  
##                                        3rd Qu.:5.650e+07  
##                                        Max.   :2.036e+10  
##                                        NA's   :1640       
##   usd_current        crs_sector_name     start_actual                
##  Min.   :1.400e+02   Length:4013        Min.   :1996-01-01 00:00:00  
##  1st Qu.:6.041e+05   Class :character   1st Qu.:2006-09-15 12:00:00  
##  Median :5.000e+06   Mode  :character   Median :2009-04-20 00:00:00  
##  Mean   :1.002e+08                      Mean   :2008-11-12 04:56:50  
##  3rd Qu.:3.990e+07                      3rd Qu.:2011-06-19 18:00:00  
##  Max.   :1.500e+10                      Max.   :2016-03-23 00:00:00  
##  NA's   :1640                           NA's   :3329                 
##  start_planned                   end_actual                 
##  Min.   :2000-01-01 00:00:00   Min.   :2000-01-01 00:00:00  
##  1st Qu.:2006-07-16 12:00:00   1st Qu.:2006-08-17 00:00:00  
##  Median :2008-09-16 00:00:00   Median :2009-08-30 12:00:00  
##  Mean   :2008-08-07 01:00:27   Mean   :2009-02-06 06:14:19  
##  3rd Qu.:2011-02-04 18:00:00   3rd Qu.:2011-10-30 12:00:00  
##  Max.   :2015-11-23 00:00:00   Max.   :2016-03-31 00:00:00  
##  NA's   :3751                  NA's   :3159                 
##   end_planned                  is_cofinanced  
##  Min.   :2000-03-26 00:00:00   Mode :logical  
##  1st Qu.:2008-06-23 12:00:00   FALSE:3835     
##  Median :2011-09-03 12:00:00   TRUE :178      
##  Mean   :2011-06-16 02:22:56                  
##  3rd Qu.:2014-08-04 00:00:00                  
##  Max.   :2020-09-09 00:00:00                  
##  NA's   :3741

Total Projects’ Number by Years

pro_num_year <- flow %>% 
  select(year, usd_defl_2014) %>% 
  group_by(year) %>%
  count()
colnames(pro_num_year) = c("Year", 
                           "Number of Projects")
pro_num_year
## # A tibble: 15 x 2
## # Groups:   year [15]
##     Year `Number of Projects`
##    <dbl>                <int>
##  1  2000                  101
##  2  2001                  135
##  3  2002                  134
##  4  2003                  181
##  5  2004                  181
##  6  2005                  242
##  7  2006                  314
##  8  2007                  327
##  9  2008                  278
## 10  2009                  383
## 11  2010                  342
## 12  2011                  390
## 13  2012                  336
## 14  2013                  342
## 15  2014                  327
gg1 <- ggplot(flow) + 
    geom_histogram(mapping = aes(x = year), 
                   binwidth = 0.5, color = "darkblue", fill = "lightblue") +
    ggtitle("Number of Projects by Years") + 
    theme(plot.title = element_text(hjust = 0.5, face="bold")) +
    xlab("Year") + ylab("Number of Projects")
gg1

The data with the graph show an overall increasing pattern for the number of projects by years, with stepbacks on 2008 due to the recession, on 2010, and 2014.

Snapshot of projects’ value by 2014 USD by years, including the total, average, min, and max value

pro_value_year <- flow %>% 
  select(year, usd_defl_2014) %>% 
  group_by(year) %>%
  summarize(sum = sum(usd_defl_2014, na.rm = T),
            avg = mean(usd_defl_2014, na.rm = T),
            min = min(usd_defl_2014, na.rm = T),
            max = max(usd_defl_2014, na.rm = T)) %>% 
  arrange(year)
colnames(pro_value_year) = c("Year", 
                             "Total Value of Project",
                             "Average Value of Project",
                             "Minimum Value of Project",
                             "Maximum Value of Project")
left_join(pro_num_year, pro_value_year, by = "Year")
## # A tibble: 15 x 6
## # Groups:   Total Value of Project [15]
##     Year `Number of Proj… `Total Value of… `Average Value …
##    <dbl>            <int>            <dbl>            <dbl>
##  1  2000              101      2337247008.        32461764.
##  2  2001              135      5455489633.        50050364.
##  3  2002              134      4376162094.        46064864.
##  4  2003              181      5525240301.        46824070.
##  5  2004              181      4520528562.        40004678.
##  6  2005              242      8045518295.        49663693.
##  7  2006              314     13831559135.        73965557.
##  8  2007              327     15695022812.        75821366.
##  9  2008              278      8530799541.        55756860.
## 10  2009              383     64117820118.       276369914.
## 11  2010              342     24826825668.       134928400.
## 12  2011              390     44008421953.       202803788.
## 13  2012              336     37135252439.       200731094.
## 14  2013              342     27958099383.       158852837.
## 15  2014              327     30997892020.       190171117.
## # … with 2 more variables: `Minimum Value of Project` <dbl>, `Maximum
## #   Value of Project` <dbl>
yearSum <- flow %>% 
    group_by(year) %>%
    summarize(sum = sum(usd_defl_2014, na.rm = TRUE))
ggplot(yearSum, mapping = aes(x = year, y = sum)) + 
    geom_bar(stat = "identity") +
    ggtitle("Graph 1: Total Value of Projects by Years") + 
    theme(plot.title = element_text(hjust = 0.5, face="bold")) +
    xlab("Year") + ylab("Total Value of Projects")

gg3 <- ggplot(flow, 
              mapping = aes(x = year, y = usd_defl_2014, 
                            group = year, color = year)) + 
    geom_point(na.rm = TRUE) +
    ggtitle("Graph 2: Total Value of Projects by Years") + 
    theme(plot.title = element_text(hjust = 0.5, face="bold")) +
    xlab("Year") + ylab("Total Value of Projects")
gg3

The data show an overall increasing pattern for the total value of projects by years, with a few stepbacks along the way. Graph 1 displays a clearer picture of the overall pattern as we can see the highest flow happened in 2009, which followed the sharp recession in 2008 and followed by a sharp recession in 2010. Graph 2 indicates that there are some outliers among projects in years, such as 2009, which helps to explain the reason why there is a peak in 2009 for the total value of projects.

Total projects’ number by regions

pro_num_region <- flow %>% 
  select(recipient_region, usd_defl_2014) %>% 
  group_by(flow$recipient_region) %>%
  count() %>% 
  arrange(desc(n))
colnames(pro_num_region) = c("Region", 
                             "Number of Projects")
pro_num_region
## # A tibble: 6 x 2
## # Groups:   flow$recipient_region [6]
##   Region                          `Number of Projects`
##   <chr>                                          <int>
## 1 Africa                                          2168
## 2 Asia                                            1035
## 3 Latin America and the Caribbean                  295
## 4 The Pacific                                      256
## 5 Central and Eastern Europe                       166
## 6 Middle East                                       93
gg4 <- ggplot(flow) + 
    geom_histogram(mapping = aes(x = recipient_region), 
                   color = "darkblue", fill = "lightblue", stat = "count") +
    ggtitle("Number of Projects by Regions") + 
    theme(plot.title = element_text(hjust = 0.5, face="bold")) +
    xlab("Region") + ylab("Number of Projects") + 
    scale_x_discrete(labels = abbreviate)
gg4

From the data, we can see that Africa receives the most number of projects. From the graph, we can see that Africa has more than twice as many projects as Asia, the region with the second most projects.

Total projects’ value by 2014 USD by regions

pro_value_region <- flow %>% 
  select(recipient_region, usd_defl_2014) %>% 
  group_by(flow$recipient_region) %>%
  summarize(sum = sum(usd_defl_2014, na.rm = T)) %>% 
  arrange(desc(sum))
colnames(pro_value_region) = c("Region", 
                               "Total Value of Projects")
pro_value_region
## # A tibble: 6 x 2
##   Region                          `Total Value of Projects`
##   <chr>                                               <dbl>
## 1 Asia                                         95822989635.
## 2 Africa                                       92713211229.
## 3 Central and Eastern Europe                   55569462403.
## 4 Latin America and the Caribbean              50403066261.
## 5 The Pacific                                   2337839964.
## 6 Middle East                                    515309469.
regionSum <- flow %>% group_by(recipient_region) %>%
  summarize(sum = sum(usd_defl_2014, na.rm = TRUE))
pie(regionSum$sum, labels = regionSum$recipient_region, 
    radius = 1, cex = 0.5, main = "Total Value of Projects by Regions") 

library(plotly)
plot_ly(regionSum, labels = ~ recipient_region, 
        values = ~ sum, type = 'pie',
        textposition = 'outside',textinfo = 'label') %>%
  layout(title = 'Total Value of Projects by Regions',
         xaxis = list(showgrid = FALSE, zeroline = FALSE, 
                      showticklabels = FALSE),
         yaxis = list(showgrid = FALSE, zeroline = FALSE, 
                      showticklabels = FALSE))

From the data, we can see that Asia receives the highest value of projects. It is evident to see in the pie chart that Asia and Africa have almost equal, significant share of the total value of projects, while the Pacific and Middle East are the regions with the least share of value.

Average projects’ value by 2014 USD by regions

avg_pro_value_region <- left_join(pro_num_region, 
                                  pro_value_region, 
                                  by = "Region") %>% 
  mutate(`Average Project Value` = `Total Value of Projects`/`Number of Projects`) %>% 
  arrange(desc(`Average Project Value`)) %>% 
  select(Region, `Average Project Value`)
avg_pro_value_region
## # A tibble: 6 x 3
## # Groups:   Total Value of Projects [6]
##   `Total Value of Project… Region                     `Average Project Val…
##                      <dbl> <chr>                                      <dbl>
## 1             55569462403. Central and Eastern Europe            334755798.
## 2             50403066261. Latin America and the Car…            170857852.
## 3             95822989635. Asia                                   92582599.
## 4             92713211229. Africa                                 42764396.
## 5              2337839964. The Pacific                             9132187.
## 6               515309469. Middle East                             5540962.

Combining data from the number and the total value of projects by regions, Central and Eastern Europe is the region that receives the highest average value of projects.

Total projects’ number by 2014 USD by countries

pro_num_country <- flow %>% 
  group_by(flow$recipient_condensed) %>% 
  count() %>% 
  arrange(desc(n)) %>% 
  head(10)
colnames(pro_num_country) = c("Country", 
                              "Total Number of Projects")
pro_num_country
## # A tibble: 10 x 2
## # Groups:   flow$recipient_condensed [10]
##    Country   `Total Number of Projects`
##    <chr>                          <int>
##  1 Cambodia                         166
##  2 Zimbabwe                         115
##  3 Pakistan                         108
##  4 Angola                           104
##  5 Tanzania                          94
##  6 Ghana                             86
##  7 Liberia                           83
##  8 Uganda                            80
##  9 Kenya                             78
## 10 Sri Lanka                         78

Cambodia ranks the first in terms of the number of projects carried out in the country.

Total projects’ value by 2014 USD by countries

pro_value_country <- flow %>% 
  group_by(flow$recipient_condensed) %>% 
  summarize(sum = sum(usd_defl_2014, na.rm = T)) %>% 
  arrange(desc(sum)) %>% 
  head(10)
colnames(pro_value_country) = c("Country", 
                                "Total Value of Projects")
pro_value_country
## # A tibble: 10 x 2
##    Country      `Total Value of Projects`
##    <chr>                            <dbl>
##  1 Russia                    36622577666.
##  2 Pakistan                  18679406540.
##  3 Angola                    14293955203.
##  4 Laos                      11632573144.
##  5 Sri Lanka                 11014082678.
##  6 Venezuela                 10768596440 
##  7 Turkmenistan              10676427454.
##  8 Ecuador                    9698347826.
##  9 Brazil                     8528111506.
## 10 Cambodia                   7973834856.

In terms of total value of the projects, Cambodia only ranks as number 10 in the table, while Russia receives Chinese foreign investment the most money-wise.

Average projects’ value by 2014 USD by countries

pro_num_country <- flow %>% 
  group_by(flow$recipient_condensed) %>% 
  count() %>% 
  arrange(desc(n)) 
colnames(pro_num_country) = c("Country", 
                              "Total Number of Projects")
pro_value_country <- flow %>% 
  group_by(flow$recipient_condensed) %>% 
  summarize(sum = sum(usd_defl_2014, na.rm = T)) %>% 
  arrange(desc(sum)) 
colnames(pro_value_country) = c("Country", 
                                "Total Value of Projects")
avg_pro_value_country <- left_join(pro_num_country, 
                                   pro_value_country, 
                                   by = "Country") %>% 
  mutate(`Average Project Value` = `Total Value of Projects`/`Total Number of Projects`) %>% 
  arrange(desc(`Average Project Value`)) %>% 
  select(Country, `Average Project Value`) %>% 
  head(10)
avg_pro_value_country
## # A tibble: 10 x 3
## # Groups:   Total Value of Projects [10]
##    `Total Value of Projects` Country      `Average Project Value`
##                        <dbl> <chr>                          <dbl>
##  1              36622577666. Russia                   2441505178.
##  2               6681307048. Cuba                     1670326762.
##  3              10676427454. Turkmenistan             1334553432.
##  4               4637474150. Argentina                 772912358.
##  5               8528111506. Brazil                    568540767.
##  6              10768596440  Venezuela                 468199845.
##  7               9698347826. Ecuador                   461826087.
##  8               6735422689. Kazakhstan                420963918.
##  9               5569237484. India                     397802677.
## 10               2878552493. Bahamas                   319839166.

This table shows the evidence that Russia receives the highest average value per project, which is likely to be one of the major reasons behind why Central and Eastern Europe is the region that receives the highest average value of projects from China.

Part 2: Texting Mining

title <- tibble(txt = flow$title)
description <- tibble(txt = flow$description)

Text Mining for Titles of the Projects

title <- title %>% 
    unnest_tokens(word, txt) %>% 
    anti_join(stop_words) %>%   
    count(word, sort = TRUE)
title
## # A tibble: 4,839 x 2
##    word             n
##    <chr>        <int>
##  1 china         2785
##  2 million        982
##  3 donates        745
##  4 linked         581
##  5 usd            571
##  6 chinese        437
##  7 project        418
##  8 medical        365
##  9 grants         351
## 10 construction   349
## # … with 4,829 more rows
title %>% 
    with(wordcloud(word, n, max.words = 100, 
                   colors = brewer.pal(8, "Dark2")))

There is a snapshot of topics mentioned in the tile of all projects. Key words, such as donates, medical, and construction, are worth paying attention.

Text Mining for Descriptions of the Projects

description <- description %>% 
    unnest_tokens(word, txt) %>% 
    anti_join(stop_words) %>% 
    count(word, sort = TRUE)
description
## # A tibble: 19,002 x 2
##    word             n
##    <chr>        <int>
##  1 china         4929
##  2 chinese       3899
##  3 project       3350
##  4 million       3079
##  5 government    1759
##  6 signed        1702
##  7 loan          1660
##  8 construction  1649
##  9 usd           1483
## 10 agreement     1380
## # … with 18,992 more rows
description %>% 
    with(wordcloud(word, n, max.words = 100, 
                   colors = brewer.pal(8, "Dark2")))

There is a snapshot of topics mentioned in the description of all projects. Key words, such as contruction, equipment, development, bank, medical, ambassador, and government, are worth paying attention.

Part 3: Predictive Models

Predictive Question

Whether a project is ODA or not?

Term Explaination

ODA means Official Development Assistance, the major type of official development finance, as opposed to Other Official Flows (OOF). The Global Chinese Official Finance Dataset includes a datset on China’s Official Development Assistance (ODA)-like projects and a dataset on China’s Other Official Flows (OOF)-like projects. The reason projects are divided into ODA-like, OOF-like is that ODA and OOF are OECD countries definition. ODAs are traditionally referred as aids, whereas OOF are just investment. Since Chinese aids and investment do not use such a standard, researchers at AidData have to discern which are which by themselves so that the data could be comparable to western investment and aids. There are about 4000 plus ODA projects and 1100 OOF projects covering the timespan of 2000 to 2014.

Predictive Models

Rationale: Since the outcome variable in this predictive model is binary, this study chooses following models to run.

Steps: For each model, it follows four steps: 1) Create new variables and split data into training and testing data; 2) Optimize the model through training data; 3) Predict the model using testing data; 4) Evaluate the model by comparing training and testing data

Data Cleaning

flow_type_dummy <- ifelse(flow$flow_class == "ODA-like", 1, 0)
is_cofinanced_dummy <- ifelse(flow$is_cofinanced == TRUE, 1, 0)
new_flow <- cbind(flow, flow_type_dummy, 
                  is_cofinanced_dummy) %>% 
    data.frame()
new_flow <- new_flow %>% 
  filter(is.na(usd_defl_2014) == FALSE) %>% 
  filter(is.na(usd_current) == FALSE)
new_flow$flow_type_dummy <- factor(new_flow$flow_type_dummy, 
                                   labels = c("isODA", "notODA"), 
                                   levels = 1:0) 
set.seed(12345)
in_train <- createDataPartition(y = new_flow$flow_type_dummy, 
                                p = 0.8, list = FALSE)
training <- new_flow[ in_train, ]
testing  <- new_flow[-in_train, ]

Logit Model

logit <- glm(training$flow_type_dummy == "isODA" ~ 
               usd_defl_2014 + usd_current + 
               as.factor(recipient_region) + as.factor(year) + 
               as.factor(crs_sector_name) + is_cofinanced_dummy,
             data = training, family = binomial(link = "logit"))

y_hat_logit <- predict(logit, newdata = testing, type = "response")
z_logit <- factor(y_hat_logit > 0.5, 
                  levels = c(TRUE, FALSE), 
                  labels = c("isODA", "notODA"))

confusionMatrix(z_logit, reference = testing$flow_type_dummy)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction isODA notODA
##     isODA    365     57
##     notODA    10     42
##                                          
##                Accuracy : 0.8586         
##                  95% CI : (0.824, 0.8887)
##     No Information Rate : 0.7911         
##     P-Value [Acc > NIR] : 0.0001027      
##                                          
##                   Kappa : 0.4817         
##                                          
##  Mcnemar's Test P-Value : 1.912e-08      
##                                          
##             Sensitivity : 0.9733         
##             Specificity : 0.4242         
##          Pos Pred Value : 0.8649         
##          Neg Pred Value : 0.8077         
##              Prevalence : 0.7911         
##          Detection Rate : 0.7700         
##    Detection Prevalence : 0.8903         
##       Balanced Accuracy : 0.6988         
##                                          
##        'Positive' Class : isODA          
## 

Penalized Logit Model

ctrl_logit <- trainControl(method = "repeatedcv", repeats = 3, 
                     classProbs = TRUE, summaryFunction = twoClassSummary)
Grid_logit <- expand.grid(.lambda = seq(0, 1, length = 10),
                         .alpha = seq(0, 1, length = 10))

penalized_logit <- train(flow_type_dummy ~ 
                           usd_defl_2014 + usd_current + 
                           as.factor(recipient_region) + as.factor(year) + 
                           as.factor(crs_sector_name) + is_cofinanced_dummy, 
                         data = training, method = "glmnet",
                         trControl = ctrl_logit, tuneGrid = Grid_logit,
                         metric = "ROC", preProcess = c("center", "scale"))

z_penalized_logit <- predict(penalized_logit, newdata = testing) 

confusionMatrix(z_penalized_logit, reference = testing$flow_type_dummy)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction isODA notODA
##     isODA    365     56
##     notODA    10     43
##                                           
##                Accuracy : 0.8608          
##                  95% CI : (0.8263, 0.8906)
##     No Information Rate : 0.7911          
##     P-Value [Acc > NIR] : 6.231e-05       
##                                           
##                   Kappa : 0.4918          
##                                           
##  Mcnemar's Test P-Value : 3.040e-08       
##                                           
##             Sensitivity : 0.9733          
##             Specificity : 0.4343          
##          Pos Pred Value : 0.8670          
##          Neg Pred Value : 0.8113          
##              Prevalence : 0.7911          
##          Detection Rate : 0.7700          
##    Detection Prevalence : 0.8882          
##       Balanced Accuracy : 0.7038          
##                                           
##        'Positive' Class : isODA           
## 

Linear Probability Model (LPM)

ols <- lm(training$flow_type_dummy == "isODA" ~ 
            usd_defl_2014 + usd_current + 
            as.factor(recipient_region) + as.factor(year) + 
            as.factor(crs_sector_name) + is_cofinanced_dummy, 
          data = training)

y_hat_ols <- predict(ols, newdata = testing) 
z_ols <- factor(y_hat_ols > 0.5, 
                levels = c(TRUE, FALSE), 
                labels = c("isODA", "notODA")) 

confusionMatrix(z_ols, reference = testing$flow_type_dummy)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction isODA notODA
##     isODA    366     62
##     notODA     9     37
##                                           
##                Accuracy : 0.8502          
##                  95% CI : (0.8149, 0.8811)
##     No Information Rate : 0.7911          
##     P-Value [Acc > NIR] : 0.0006452       
##                                           
##                   Kappa : 0.4355          
##                                           
##  Mcnemar's Test P-Value : 6.775e-10       
##                                           
##             Sensitivity : 0.9760          
##             Specificity : 0.3737          
##          Pos Pred Value : 0.8551          
##          Neg Pred Value : 0.8043          
##              Prevalence : 0.7911          
##          Detection Rate : 0.7722          
##    Detection Prevalence : 0.9030          
##       Balanced Accuracy : 0.6749          
##                                           
##        'Positive' Class : isODA           
## 

Robust LPM

ols <- lm(training$flow_type_dummy == "isODA" ~ 
            usd_defl_2014 + usd_current + 
            as.factor(recipient_region) + as.factor(year) + 
            as.factor(crs_sector_name) + is_cofinanced_dummy, 
          data = training)



robust <- lmrob(training$flow_type_dummy == "isODA" ~ 
                      usd_defl_2014 + usd_current + 
                      as.factor(recipient_region) + as.factor(year) + 
                      as.factor(crs_sector_name) + is_cofinanced_dummy, 
                    data = training)

y_hat_robust <- predict(robust, newdata = testing) 
z_robust <- factor(y_hat_ols > 0.5, 
                   levels = c(TRUE, FALSE), 
                   labels = c("isODA", "notODA")) 

confusionMatrix(z_robust, reference = testing$flow_type_dummy)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction isODA notODA
##     isODA    366     62
##     notODA     9     37
##                                           
##                Accuracy : 0.8502          
##                  95% CI : (0.8149, 0.8811)
##     No Information Rate : 0.7911          
##     P-Value [Acc > NIR] : 0.0006452       
##                                           
##                   Kappa : 0.4355          
##                                           
##  Mcnemar's Test P-Value : 6.775e-10       
##                                           
##             Sensitivity : 0.9760          
##             Specificity : 0.3737          
##          Pos Pred Value : 0.8551          
##          Neg Pred Value : 0.8043          
##              Prevalence : 0.7911          
##          Detection Rate : 0.7722          
##    Detection Prevalence : 0.9030          
##       Balanced Accuracy : 0.6749          
##                                           
##        'Positive' Class : isODA           
## 

LDA

LDA <- train(flow_type_dummy ~ 
               usd_defl_2014 + usd_current + 
               as.factor(recipient_region) + as.factor(year) + 
               as.factor(crs_sector_name) + is_cofinanced_dummy, 
             data = training, method = "lda", 
             preProcess = c("center", "scale"))

z_LDA <- predict(LDA, newdata = testing)

confusionMatrix(z_LDA, reference = testing$flow_type_dummy)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction isODA notODA
##     isODA    354     54
##     notODA    21     45
##                                           
##                Accuracy : 0.8418          
##                  95% CI : (0.8057, 0.8735)
##     No Information Rate : 0.7911          
##     P-Value [Acc > NIR] : 0.0031562       
##                                           
##                   Kappa : 0.4543          
##                                           
##  Mcnemar's Test P-Value : 0.0002199       
##                                           
##             Sensitivity : 0.9440          
##             Specificity : 0.4545          
##          Pos Pred Value : 0.8676          
##          Neg Pred Value : 0.6818          
##              Prevalence : 0.7911          
##          Detection Rate : 0.7468          
##    Detection Prevalence : 0.8608          
##       Balanced Accuracy : 0.6993          
##                                           
##        'Positive' Class : isODA           
## 

QDA

QDA <- train(flow_type_dummy ~ 
               usd_defl_2014 + usd_current, 
             data = training, method = "qda", 
             preProcess = c("center", "scale"))

z_QDA <- predict(QDA, newdata = testing)

confusionMatrix(z_QDA, reference = testing$flow_type_dummy)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction isODA notODA
##     isODA    369     78
##     notODA     6     21
##                                           
##                Accuracy : 0.8228          
##                  95% CI : (0.7854, 0.8561)
##     No Information Rate : 0.7911          
##     P-Value [Acc > NIR] : 0.0486          
##                                           
##                   Kappa : 0.2678          
##                                           
##  Mcnemar's Test P-Value : 9.429e-15       
##                                           
##             Sensitivity : 0.9840          
##             Specificity : 0.2121          
##          Pos Pred Value : 0.8255          
##          Neg Pred Value : 0.7778          
##              Prevalence : 0.7911          
##          Detection Rate : 0.7785          
##    Detection Prevalence : 0.9430          
##       Balanced Accuracy : 0.5981          
##                                           
##        'Positive' Class : isODA           
## 

PLSDA

ctrl_PLSDA <- trainControl(method = "repeatedcv", repeats = 2, 
                     classProbs = TRUE, summaryFunction = twoClassSummary)
PLSDA_grid <- expand.grid(.ncomp = 1:2)

PLSDA <- train(flow_type_dummy ~ 
                 usd_defl_2014 + usd_current + 
                 as.factor(recipient_region) + as.factor(year) + 
                 as.factor(crs_sector_name) + is_cofinanced_dummy, 
               data = training, method = "pls", 
               preProcess = c("center", "scale"), metric = "ROC", 
               trControl = ctrl_PLSDA, tuneGrid = PLSDA_grid)

z_PLSDA <- predict(PLSDA, newdata = testing)

confusionMatrix(z_PLSDA, reference = testing$flow_type_dummy)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction isODA notODA
##     isODA    372     72
##     notODA     3     27
##                                           
##                Accuracy : 0.8418          
##                  95% CI : (0.8057, 0.8735)
##     No Information Rate : 0.7911          
##     P-Value [Acc > NIR] : 0.003156        
##                                           
##                   Kappa : 0.356           
##                                           
##  Mcnemar's Test P-Value : 4.096e-15       
##                                           
##             Sensitivity : 0.9920          
##             Specificity : 0.2727          
##          Pos Pred Value : 0.8378          
##          Neg Pred Value : 0.9000          
##              Prevalence : 0.7911          
##          Detection Rate : 0.7848          
##    Detection Prevalence : 0.9367          
##       Balanced Accuracy : 0.6324          
##                                           
##        'Positive' Class : isODA           
## 

Nearest Shrunken Centroids

ctrl_NSC <- trainControl(method = "repeatedcv", repeats = 3, 
                     classProbs = TRUE, summaryFunction = twoClassSummary)
grid_NSC <- data.frame(.threshold = 0:25)

NSC <- train(flow_type_dummy ~ 
               usd_defl_2014 + usd_current + 
               as.factor(recipient_region) + as.factor(year) + 
               as.factor(crs_sector_name) + is_cofinanced_dummy, 
             data = training, method = "pam", 
             preProcess = c("center", "scale"), metric = "ROC", 
             trControl = ctrl_NSC, tuneGrid = grid_NSC)
## 1111111111111111111111111111111
z_NSC <- predict(NSC, newdata = testing) 

confusionMatrix(z_NSC, reference = testing$flow_type_dummy)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction isODA notODA
##     isODA    375     98
##     notODA     0      1
##                                          
##                Accuracy : 0.7932         
##                  95% CI : (0.754, 0.8288)
##     No Information Rate : 0.7911         
##     P-Value [Acc > NIR] : 0.4818         
##                                          
##                   Kappa : 0.0159         
##                                          
##  Mcnemar's Test P-Value : <2e-16         
##                                          
##             Sensitivity : 1.0000         
##             Specificity : 0.0101         
##          Pos Pred Value : 0.7928         
##          Neg Pred Value : 1.0000         
##              Prevalence : 0.7911         
##          Detection Rate : 0.7911         
##    Detection Prevalence : 0.9979         
##       Balanced Accuracy : 0.5051         
##                                          
##        'Positive' Class : isODA          
## 

Boosting

gbmControl = trainControl(method="cv", 
                          number=5, returnResamp = "all")

gbm = train(flow_type_dummy ~ 
              usd_defl_2014 + usd_current +
              as.factor(recipient_region) + as.factor(year) + 
              as.factor(crs_sector_name) + is_cofinanced_dummy,
            data=training, method="gbm",
            distribution="bernoulli", 
            trControl=gbmControl, verbose=F, 
            tuneGrid=data.frame(.n.trees=seq(100, 1000, 
                                              by = 100), 
                                 .shrinkage=c(0.01, 0.1), 
                                 .interaction.depth=1:2, 
                                 .n.minobsinnode=1:5),
             na.action = na.omit)

y_hat_gbm <- predict(gbm, newdata = testing, na.action = na.pass)

confusionMatrix(y_hat_gbm, reference = testing$flow_type_dummy)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction isODA notODA
##     isODA    354     47
##     notODA    21     52
##                                           
##                Accuracy : 0.8565          
##                  95% CI : (0.8217, 0.8868)
##     No Information Rate : 0.7911          
##     P-Value [Acc > NIR] : 0.0001666       
##                                           
##                   Kappa : 0.5195          
##                                           
##  Mcnemar's Test P-Value : 0.0024318       
##                                           
##             Sensitivity : 0.9440          
##             Specificity : 0.5253          
##          Pos Pred Value : 0.8828          
##          Neg Pred Value : 0.7123          
##              Prevalence : 0.7911          
##          Detection Rate : 0.7468          
##    Detection Prevalence : 0.8460          
##       Balanced Accuracy : 0.7346          
##                                           
##        'Positive' Class : isODA           
## 

Random Forest

rf<- randomForest(flow_type_dummy ~ 
                    usd_defl_2014 + usd_current + 
                    is_cofinanced_dummy, 
                  data=training,
                  importance = TRUE,
                  na.action = na.omit)

y_hat_rf <- predict(rf, newdata = testing,
                 type = "response", na.action = na.pass)

confusionMatrix(y_hat_rf, reference = testing$flow_type_dummy)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction isODA notODA
##     isODA    349     51
##     notODA    26     48
##                                           
##                Accuracy : 0.8376          
##                  95% CI : (0.8012, 0.8696)
##     No Information Rate : 0.7911          
##     P-Value [Acc > NIR] : 0.006386        
##                                           
##                   Kappa : 0.4581          
##                                           
##  Mcnemar's Test P-Value : 0.006237        
##                                           
##             Sensitivity : 0.9307          
##             Specificity : 0.4848          
##          Pos Pred Value : 0.8725          
##          Neg Pred Value : 0.6486          
##              Prevalence : 0.7911          
##          Detection Rate : 0.7363          
##    Detection Prevalence : 0.8439          
##       Balanced Accuracy : 0.7078          
##                                           
##        'Positive' Class : isODA           
## 

Neural Network Model

nnetGrid <- expand.grid(.decay = c(0, 0.01, .1),
                        .size = c(1:10))
ctrl_nn <- trainControl(method = "cv", number = 10)

nn <- train(flow_type_dummy ~ 
              usd_defl_2014 + usd_current + 
              as.factor(recipient_region) + as.factor(year) + 
              as.factor(crs_sector_name) + is_cofinanced_dummy, 
            data = training, method = "nnet",
            trControl = ctrl_nn, tuneGrid = nnetGrid,
            preProcess = c("center", "scale"), trace = FALSE)

y_hat_nn <- predict(nn, newdata = testing)

confusionMatrix(y_hat_nn, reference = testing$flow_type_dummy)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction isODA notODA
##     isODA    353     48
##     notODA    22     51
##                                          
##                Accuracy : 0.8523         
##                  95% CI : (0.8171, 0.883)
##     No Information Rate : 0.7911         
##     P-Value [Acc > NIR] : 0.0004175      
##                                          
##                   Kappa : 0.5053         
##                                          
##  Mcnemar's Test P-Value : 0.0028074      
##                                          
##             Sensitivity : 0.9413         
##             Specificity : 0.5152         
##          Pos Pred Value : 0.8803         
##          Neg Pred Value : 0.6986         
##              Prevalence : 0.7911         
##          Detection Rate : 0.7447         
##    Detection Prevalence : 0.8460         
##       Balanced Accuracy : 0.7282         
##                                          
##        'Positive' Class : isODA          
## 

MARS model

ctrl_mars <- trainControl(method = "cv", number = 10)
marsGrid <- expand.grid(.degree = 1:3, .nprune = 1:10)

mars <- train(flow_type_dummy ~ usd_defl_2014 + usd_current + 
                as.factor(recipient_region) + as.factor(year) + 
                as.factor(crs_sector_name) + is_cofinanced_dummy,  
              data = training, method = "earth", 
              trControl = ctrl_mars, tuneGrid = marsGrid)

y_hat_mars <- predict(mars, newdata = testing)

confusionMatrix(y_hat_mars, reference = testing$flow_type_dummy)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction isODA notODA
##     isODA    356     59
##     notODA    19     40
##                                           
##                Accuracy : 0.8354          
##                  95% CI : (0.7989, 0.8677)
##     No Information Rate : 0.7911          
##     P-Value [Acc > NIR] : 0.00889         
##                                           
##                   Kappa : 0.4151          
##                                           
##  Mcnemar's Test P-Value : 1.006e-05       
##                                           
##             Sensitivity : 0.9493          
##             Specificity : 0.4040          
##          Pos Pred Value : 0.8578          
##          Neg Pred Value : 0.6780          
##              Prevalence : 0.7911          
##          Detection Rate : 0.7511          
##    Detection Prevalence : 0.8755          
##       Balanced Accuracy : 0.6767          
##                                           
##        'Positive' Class : isODA           
## 

Conclusion

According to the rank of the overall predication accuracy, the penalized logit model and boosting model perform the best in predicting whether the project is an ODA or not with the highest accuracy. Meanwhile notably, neural network model has the highest balanced accuracy among all models. With a relatively high level of prediction accuracy, this study will help categorize Chinese foreign investment in the future. As China does not classify its foreign investment by the OECD standard, it is very hard to directly compare its investment efficiency with aid efficiency of Western countries or that of the World Bank. Therefore, this study would, to a certain degree, help address this problem, and provide a benchmark for researchers to compare Chinese and Western foreign investment in the right categories.